2022-06-04

The Algorithm

  • High performance algorithm modified Kneser-Ney (MKN) n-gram model as implemented by Keneth Heafield, described in his seminal paper “Scalable Modified Kneser-Ney Language Model Estimation”.

  • Allows us to control the size of the resulting model through variables:

    • Pruning: minimum number of times an ngram appears in the training set.

    • Sample size of the original data.

    • Do we ignore case? Disregarding it should result in less data if the rest of the variables remain the same.

    • Ngram order: smaller ‘n’ result in smaller n-gram models

Best model

  • We searched though 1792 models of these variables combinations:
    • Of sample sizes from 10% to 90% of the language
    • Lowercase and uppercase
    • Ngram orders ranging from 1 to 6
    • Pruning ranging from 0 to 40

Model selection

The leftmost point is the lowest perplexity model on this grid. Through this interactive graph you can see the best model is a 90% sample of lowercase ngrams with no singletons and n=4.

The best model

It seems we can get close to a 400MB size model with the best perplexity using almost the whole corpus (90%) lower cased, building a 4-gram language model, pruning all singletons (prune=1) of all orders.

The app

  • You can access the app here.

  • To use it, simply introduce your text and a 1 word completion should appear, representing the most probable ending to the phrase.

  • An animated graph interface shows the algorithm traversing the language model in real time.